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(54) Title: VARIABLE RATE SPEECH CODING 
(57) Abstract 

A method and apparatus for the variable rate coding of a speech 
signal. An input speech signal is classified and an appropriate coding 
mode is selected based on this classification. For each classification, 
the coding mode that achieves the lowest bit rate with an acceptable 
quality of speech reproduction is selected. Low average bit rates arc 
achieved by only employing high fidelity modes (i.e.. high bit rate, 
broadly applicable to different types of speech) during portions of the 
speech where this fidelity is required for acceptable output. L^wer 
bit rate modes are used during portions of speech where these modes 
produce acceptable output. Input speech signal is classified into active 
and inactive regions. Active regions are further classified into voiced, 
unvoiced, and transient regions. Various coding modes arc applied to 
active speech, depending upon the required level of fidelity. Coding 
modes may be utilized according to the strengths and weaknesses of 
each particular mode. The apparatus dynamically switches between 
these modes as the properties of the speech signal vary with time. And 
where appropriate, regions of speech are modeled as pseudo-random 
noise, resulting in a significantly lower bit rate. This coding is used in 
a dynamic fashion whenever unvoiced speech or background noise is 
detected. 
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VARIABLE RATE SPEECH CODING 

BACKGROUND OF THE INVENTION 

I. Field of the Invention 

5 The present invention relates to the coding of speech signals. Specifically, the 

present invention relates to classifying speech signals and employing one of a plurality of 
coding modes based on the classification. 

n. Description of the Related Art 

Many commimication systems today transmit voice as a digital signal, particularly 

10 long distance and digital radio telephone appUcations. The performance of these systems 
depends, in part, on accurately representing the voice signal with a minimum number of 
bits. Transmitting speech simply by samphng and digitizing requires a data rate on the 
order of 64 kilobits per second (kbps) to achieve the speech quality of a conventional 
analog telephone. However, coding techniques are available that significantly reduce the 

15 data rate required for satisfactory speech reproduction. 

The term "vocoder" typically refers to devices that compress voiced speech by 
extracting parameters based on a model of human speech generation. Vocoders include an 
encoder and a decoder. The encoder analyzes the incoming speech and extracts the relevant 
parameters. The decoder synthesizes the speech using the parameters that it receives firom 

20 the encoder via a transmission channel. The speech signal is often divided into firames of 
data and block processed by the vocoder. 

Vocoders built around linear-prediction-based time domain coding schemes far 
exceed in number all other types of coders. These techniques extract correlated elements 
from the speech signal and encode only the uncorrected elements. The basic linear 

25 predictive filter predicts the current sample as a linear combination of past samples. An 
example of a coding algorithm of this particular class is described in the paper "A 4.8 kbps 
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Cod. Exdted Linear Pred,aive Coder.- by Thomas E. Trench « .i. Pr„ce«u„gs of ,he 

Mobile Satellite Conference, 1988. 

These coding schemes compress the digitized speech signal into a low bit rate signal 
by removing all of the natural redundancies (/.e.. correlated elements) .nherent in speech 
. Speech typically exhibhs short term redundancies resulting from the mechamcal action of 
the hps and tongue, and long term redundancies resulting from the vibration of the vocal 
cords. Unear predictive schemes model these operations as filters, remove the 
redundancies, and then model the resulting residual signal as white gaussian noise. Linear 
predictive coders therefore achieve a reduced bit rate by transmitting filter coefficients and 
10 quantized noise rather than a full bandwidth speech signal. 

However, even these reduced bit rates often exceed the available bandwidth where 
the speech signal must either propagate a long distance (..^.. ground to satellite) or coexist 
with many other signals in a crowded channel. A need therefore exists for an improved 
coding scheme which achieves a lower bit rate than Unear predictive schemes. 



15 



SUMMARY OF THE INVENTION 



The present invention is a novel and improved method and apparatus for the 
variable rate coding of a speech signal. The presem invention classifies the input speech 
signal and selects an appropriate coding mode based on this classification. For each 
classification, the present invention selects the coding mode that achieves the lowest bit rate 

20 with an acceptable quality of speech reproduction. The presem invention achieves low 
average bit rates by only employing high fidelity modes {i.e., high bit rate, broadly 
applicable to different types of speech) during portions of the speech where this fidelity is 
required for acceptable output. The present invention switches to lower bit rate modes 
during portions of speech where these modes produce acceptable output. 

25 An advantage of the present invention is that speech is coded at a low bit rate. Low 

bit rates translate into higher capacity, greater range, and lower power requirements. 

A feature of the present invention is that the input speech signal is classified into 
active and inactive regions. Active regions are further classified into voiced, unvoiced, and 
transient regions. The present invention therefore can apply various coding modes to 

30 different types of active speech, depending upon the required level of fidelity. 
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Another feature of the present invention is that coding modes may be utiUzed 
according to the strengths and weaknesses of each particular mode. The present invention 
dynamically switches between these modes as properties of the speech signal vary with 
time. 

5 A further feature of the present invention is that, where appropriate, regions of 

speech are modeled as pseudo-random noise, resulting in a significantly lower bit rate. The 
present invention uses this coding in a dynamic fashion whenever unvoiced speech or 
background noise is detected. 

The features, objects, and advantages of the present invention will become more 
10 apparent from the detailed description set forth below when taken in conjunction with the 
drawings in which like reference numbers indicate identical or functionally similar 
elements. Additionally, the left-most digit of a reference number identifies the drawing in 
which the reference number first appears. . 



BRIEF DESCRIPTION OF THE DRAWINGS 

15 FIG. 1 is a diagram illustrating a signal transmission environment; 

FIG. 2 is a diagram illustrating encoder 102 and decoder 104 in greater detaU; 
FIG. 3 is a flowchart illustrating variable rate speech coding according to the present 
invention; 

FIG. 4A is a diagram illustrating a firame of voiced speech split into subframes; 
20 FIG. 4B is a diagram illustrating a frame of unvoiced speech split into subframes; 

FIG. 4C is a diagram illustrating a frame of transient speech split into subframes; 
FIG. 5 is a flowchart that describes the calculation of initial parameters; 
FIG. 6 is a flowchart describing the classification of speech as either active or 
inactive; 

25 FIG. 7A depicts a CELP encoder; 

FIG. 7B depicts a CELP decoder; 
FIG. 8 depias a pitch filter module; 
FIG. 9A depicts a PPP encoder; 
FIG. 9B depicts a PPP decoder; 
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FIG. 10 is a flowchart depictmg the steps of PPP coding, including encoding and 

decoding; 

FIG 11 .s a flowchart describing the extraction of a prototype residual period; 
FIG. 12 depicts a prototype residual period extracted from the current frame of a 
5 residual signal, and the prototype residual period from the previous frame; 

FIG. 1 3 is a flowchart depicting the calculation of rotational parameters; 
no. 14 is a flowchart depicting the operation of the encoding codebook; 
HG. 1 5 A depicts a first filter update module embodiment; 
FIG. 1 5B depicts a first period interpolator module embodiment; 
10 PIG. 16A depicts a second fiher update module embodiment; 

FIG. 16B depicts a second period interpolator module embodiment; 
FIG. 1 7 is a flowchart describing the operation of the first filter update module 
embodiment; 

FIG. 18 is a flowchart describing the operation of the second filter update module 
15 embodiment; 

FIG. 19isaflowchartdescribingthealigningandinterpolatingofprototyperesidual . 

periods; 

FIG. 20 is a flowchart describing the reconstniction of a speech signal based on 
prototype residual periods according to a first embodiment; 
20 FIG. 21 is a flowchart describing the reconstruction of a speech signal based on 

prototype residual periods according to a second embodiment; 

FIG. 22A depicts a NELP encoder; 

FIG. 22B depicts a NELP decoder; and 

FIG. 23 is a flowchart describing NELP coding. 
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DETAILED DESCRIPTION OF THE PREFERRED 
EMBODIMENTS 

I. Overview of the Environment 

II. Overview of the Invention 

5 III. Initial Parameter Determination 

A. Calculation of LPC CoefFicients 

B. LSI Calculation 

C. NACF Calculation 

D. Pitch Track and Lag Calculation 

^ Calculation of Band Energy and Zero Crossing Rate 
F. Calculation of the Formant Residual 

IV. Active/Inactive Speech Classification 

A. Hangover Frames 

V. Classification of Active Speech Frames 
15 VI. Encoder/Decoder Mode Selection 

VII. Code Excited Linear Prediction (CELP) Coding Mode 

A. Pitch Encoding Module 

B. Encoding codebook 

C. CELP Decoder 

20 D, Filter Update Module 

VIII. Prototype Pitch Period (PPP) Coding Mode 

A. Extraction Module 

B. Rotational Correlator 

C. Encoding Codebook 
25 D. Fiher Update Module 

E. PPP Decoder 

F. Period Interpolator 

IX. Noise Excited Linear Prediction (NELP) Codmg Mode 

X. Conclusion 
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Overview of the Environment 



The present invenfon .s directed toward novel and improved methods and 
apparatuses for variable rate speech coding. FIG. 1 depicts a signal transmission 
environment 1 00 including an encoder 102, a decoder 104. and a transmission medium 106 
0 Encoder 102 encodes a speech signal sfnj, fonning encoded speech signal fnj for 
transmission across transmission medium 1 06 to decoder 1 04. Decoder 1 04 decodes s,Jn), 
thereby generating synthesized speech signal s(n) 

The term "coding" as used herein refers generally to methods encompassing both 
encoding and decoding. Generally, coding methods and apparatuses seek to minimize the 
10 number of bhs transmitted via transmission medium 106 (/..., minimize the bandv^idth of 
^.JnJ) while maintaining acceptable speech reproduction (i.e., s(n)^s(n)). The 
composition of the encoded speech signal wiU vary according to the particular speech 
coding method. Various encoders 102. decoders 104. and the coding methods according 
to which they operate are described below. 
15 The components of encoder 102 and decoder 104 described below may be 

implementedaselectronichardware, ascomputersoftware, or combinations of both. These 
components are described below in terms of their functionality. Whether the fimctionality 
IS implemented as hardware or software will depend upon the particular apphcation and 
design constraints imposed on the overall system. Skilled artisans will recognize the 
20 interchangeability of hardware and software under these circumstances, and how best to 
implement the described functionality for each particular application. 

Those skiUed in the art will recognize that transmission medium 1 06 can represent 
many different transmission media, including, but not limited to, a land-based 
communication line, a link between a base station and a satellite, wireless communication 
25 between a ceUular telephone and a base station, or between a cellular telephone and a 
satellite. 

Those skilled in the art will also recognize that often each party to a communication 
transmits as well as receives. Each party would therefore require an encoder 102 and a 
decoder 104. However, signal tranmission environment 100 will be described below as 
30 including encoder 1 02 at one end of transmission medium 1 06 and decoder 1 04 at the other. 
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Skilled artisans will readUy recognize how to extend these ideas to two-way communication. 

For purposes of this description, assume that s(n) is a digital speech signal obtained 
during a typical conversation including different vocal sounds and periods of silence. The 
speech signal sfn) is preferably partitioned into frames, and each frame is further panitioned 
5 into subframes (preferably 4). These arbitrarily chosen frame/subframe boundaries are 
commonly used where some block processing is performed, as is the case here. Operations 
described as being performed on frames might also be performed on subframes-in this 
sense, frame and subframe are used interchangeably herein. However, s(r.) need not be 
partitioned into frames/subframes at all if continuous processing rather than block 

10 processing is impleraemed. Skilled artisans will readily recognize how the block 
techniques described below might be extended to continuous processing. 

In a preferred embodiment, sCn) is digitally sampled at 8 kHz. Each frame 
preferably contains 20ms of data, or 160 samples at the preferred 8 kHz rate. Each 
subframe therefore contains 40 samples of data. It is important to note that many of the 

15 equations presented below assume these values. However, those skiUed in the art will 
recognize that while these parameters are appropriate for speech coding, they are merely 
exemplary and other suitable alternative parameters could be used. 
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n. Overview of the Invention 

The methods and apparatuses of the present invention involve coding the speech 
signal s(n). FIG. 2 depicts encoder 102 and decoder 104 in greater detail. According to the 
present invention, encoder. 102 includes an initial parameter calculation module 202 a 
5 classification module 208. and one or more encoder modes 204. Decoder 1 04 inch^des one 
or more decoder modes 206. The number of decoder modes, N„ in general equals the 
number of encoder modes. N,. As would be apparent to one skilled in the art. encoder 
mode 1 communicates with decoder mode 1, and so on. As shown, the encoded speech 
signal. s,Jn), is transmitted via transmission medium 1 06. 

10 In a preferred embodiment, encoder 1 02 dynamically switches between multiple 

encoder modes from frame to frame, depending on which mode is most appropriate given 
the properties of s(n) for the current frame. Decoder 104 also dynamically switches 
between the corresponding decoder modes from frame to frame. A particular mode is 
chosen for each frame to achieve the lowest bit rate available while maintaining acceptable 

15 signal reproduction at the decoder. This process is referred to as variable rate speech 
coding, because the bit rate of the coder changes over time (as properties of the signal 
change). 

no. 3 is a flowchart 300 that describes variable rate speech coding according to the 
present invention. In step 302, initial parameter calculation module 202 calculates various 
20 parameters based on the current frame of data. In a preferred embodiment, these parameters 
inchide one or more of the following: linear predictive coding (LPC) filter coefBciems, line 

spectruminformation(LSI)coeffidents,thenormalizedautocorrelationfunctions(NACFs), 

the open loop lag. band energies, the zero crossing rate, and the formant residual signal. 
In step 304, classification module 208 classifies the current frame as containing 
25 either "active" or "inactive" speech. As described above, s(n) is assumed to include both 

periods of speech and periods of silence, common to an ordinary conversation. Active 

speech includes spoken words, whereas inactive speech includes everything else, e.g.. 

background noise, silence, pauses. The methods used to classify speech as active/inactive 

according to the present invention are described in detail below. 
30 As shown in FIG. 3, step 306 considers whether the current frame was classified as 

active or inactive in step 304. If active, control flow proceeds to step 308. If inactive. 

control flow proceeds to step 3 10. 
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Tho« frames wUch are c.,.^ed a. acivc a„ cteified in step 308 a« 

^, vcccd. unvoiced, or .ra.^.„, f,a.es. The. =idUed i„ «,e an wi„ recognize .ha, 
hu„.an speech can be classified in n^any difieren. Two conventiona, classificattons 
ofspeech are voiced and unvoiced sounds. According ,o *e p„senUnven,io.. ^, speech 
5 whch ,s no. vcced or unvoiced is classified as .ransien. speech. 

"A dep,cs an «a„,p,. portion of.W including voiced speech 402 Voiced 
soun s are produced by forcing air trough tt.e glo.Us wiu, *e .ens.o„ of .he voca, cords 
adjus.ed so .ha. d,ey vibrare in a relaxed osci,la.io„. Urereby producng ,„asi.peri„dic 

pulses of air which excite the vocal frs^r^f n«« ^« 

. ^ ^"^'^^"^on property measured in voiced speech 

lu IS the pitch period, as shovm in FIG. 4A. 

FIG. 4B depics an example portion of s(.J including unvoiced speech 404 
Unvoiced sounds are genemed by forming . cons.ric.ion a. some poin. in ti,e vocal ,rac 
(usually .oward ti,e mouU, end), and forcing air a„ough Ae consrtction a. a high enough 
velocy .o produce h„bu,encc. The resulting unvoiced speech signal resembles colored 

13 noise. 

FIG. 4C depicts an example portion of including transient speech 406 (/ e 
speech which is neither voiced nor unvoiced). The example transient speech 406 show^ 
m FIG. 4C might represent .(nj transitioning between unvoiced speech and voiced speech 
Skilled artxsans wiU recognize that many different classifications of speech could be 
20 employed according to the techniques described herein to achieve comparable results 

In step 3 1 0. an encoder/decoder mode is selected based on the frame classification 
made m steps 306 and 308. The various encoder/decoder modes are connected in parallel 
as shown in FIG 2. One or more of these modes can be operational at any g.ven time' 
However, as described in detail below, only one mode preferably operates at any given 
25 time, and .s selected according to the classification of the current frame. 

Several encoder/decoder modes are described in the following sections The 
different encoder/decoder modes operate according to different coding schemes. Certain 
modes are more effective at coding portions of the speech signal s^nj exhibiting certain 
properties. 

30 In a preferred embodiment, a "Code Excited Linear Predictive" (CELP) mode is 

chosen to code frames classified as transient speech. The CELP mode excites a linear 
predictive vocal tract model with a quantized version of the linear prediction residual 
Mgnal. Of all the encoder/decoder modes described herein. CELP generally produces the 
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most accurate speech reproduction but requires the highest bit rate. In one embodunent. the 
CELP mode performs encoding at 8500 bits per second. 

A "Prototype Pitch Period" (PPP) mode is preferably chosen to code frames 
classified as voiced speech. Voiced speech contains slowly time vaiying periodic 
5 components which are exploited by the PPP mode. The PPP mode codes only a subset of 
the pitch periods within each frame. The remaining periods of the speech signal are 
reconstructed by interpolating between these prototype periods. By exploiting the 
periodicity of voiced speech, PPP is able to achieve a lower bit rate than CELP and stiU 
reproduce the speech signal in a perceptually accurate manner. In one embodiment, the 

10 PPP mode performs encoding at 3900 bits per second. 

A "Noise Excited Linear Predidtive" (NELP) mode is chosen to code frames 
classified as unvoiced speech. NELP uses a filtered pseudo-random noise signal to model 
unvoiced speech. NELP uses the simplest model for the coded speech, and therefore 
achieves the lowest bit rate. In one embodiment, the NELP mode performs encoding at 

15 1500 bits per second. 

The same coding technique can frequently be operated at different bit rates, with 
varying levels of performance. The different encoder/decoder modes in FIG. 2 can 
therefore represent different coding techniques, or the same codmg technique operating at 
different bit rates, or combinations of the above. Skilled artisans will recognize that 

20 increasing the number of encoder/decoder modes wiU allow greater flexibility when 
choosing a mode, which can result in a lower average bit rate, but will increase complexity 
within the overall system. The particular combination used in any given system will be 
dictated by the available system resources and tiie specific signal environment. 

In step 3 12, the selected encoder mode 204 encodes the current frame and preferably 

25 packs the encoded data into data packets for transmission. And in step 314, the 
corresponding decoder mode 206 unpacks the data packets, decodes the received data and 
reconstructs the speech signal. These operations are described in detail below with respect 
to the appropriate encoder/decoder modes. 
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in. Initial Parameter Determination 

FIG. 5 is a flowchart describing step 302 in greater detail. Various initial 
parameters are calculated according to the present invention. The parameters preferably 
include, e.g., LPC coefficients, line spectrum information (LSI) coefficients, normalized 
5 autocorrelation functions (NACFs), open loop lag. band energies, zero crossing rate, and 
the formant residual signal. These parameters are used in various ways within the overall 
system, as described below. 

In a preferred embodiment, initial parameter calculation module 202 uses a "look 
ahead" of 160 + 40 samples. This serves several purposes. First, the 160 sample look 

10 ahead allows a pitch frequency track to be computed using information in the next frame, 
which significantly improves the robustness of the voice coding, and the pitch period 
estimation techniques, described below. Second, the 160 sample look ahead also allows 
the LPC coefficients, the frame energy, and the voice activity to be computed for one frame 
in the future. This allows for efficient, multi-frame quantization of the frame energy and 

15 LPC coefficients. Third, the additional 40 sample look ahead is for calculation of the LPC 
coefficients on Hamming windowed speech as described below. Thus the number of 
samples buffered before processing the current frame is 1 60 + 1 60 + 40 which includes the 
current frame and the 160 + 40 sample look ahead. 

A. Calculation of LPC Coefficients 

20 The present invention utilizes an LPC prediction error filter to remove the short 

term redundancies in the speech signal. The transfer function for the LPC filter is. 

10 

The present invention preferably implements a tenth-order filter, as shown in the previous 
equation. An LPC synthesis filter in the decoder reinserts the redundancies, and is given 
25 by the inverse of A(z): 

1 1 
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In step 502. the LPC coefficients. a„ are computed from s(n) as follows. The LPC 
parameters are preferably computed for the next frame during the encoding procedure for 
the current frame. 

A Hamming window is applied to the current frame centered between the 1 19"> and 
^ 120* samples (assuming the preferred 160 sample frame with a "look ahead"). The 
windowed speech signal, sjn) is given by: 

^.(«)= .(« + 40)(o.5.0.46*co{.^^]), 0^ «< 160 

The offset of 40 samples results in the window of speech being centered between the 1 1 9* 
and 120'^ sample of the preferred 160 sample frame of speech. 
10 Eleven autocorrelation values, are preferably computed as 

H s^{rn)s^{m^-k), Q<k<\0 

The autocorrelation values are windowed to reduce the probability of missing roots of line 
spectral pairs (LSPs) obtained from the LPC coefficients, as given by: 

1 5 resulting in a slight bandwidth expansion, e.g., 25 Hz. The values are preferably taken 
from the center of a 255 point Hamming window. 

The LPC coefficients are then obtained from the windowed autocorrelation values 
using Durbin's recursion. Durbin's recursion, a well known efficient computational 
method, is discussed in the text Digital Processing of Snee.h .<;,Vn.U p.k;... e^n.f,, 

20 B. LSI Calculation 

In step 504. the LPC coefficients are transformed into line spectrum information 
(LSI) coefficients for quantization and interpolation. The LSI coefficients are computed 
according to the present invention in the fcUowing manner. 

As before, A(2) is given by 

where a, are the LPC coefficients, and 1 ^ / ^ lo. 



,WO 00/38179 PCT/US99;30587 
PJz) and QJz) are defined as the following 



where 



Pi = -^.-a,,-,. i^/< 10 

q,= - a, + a,,_. , 1< / < 10 



and 



^0=1 = 1 

^0=1 = - 1 

The line spectral cosines (LSCs) are the ten roots in -1 .0 < x < 1 .0 of the following 
two functions; 

P'o cos(5cos~^ (x))-^^ p'j (4cos~^ CxJ)+-- + p'^ + 
Q'(x)= q'o cos(5 cos~^(x)J+ q'j (4 cos~ ^ (x)) + — + q' ^ jc+ q' 

10 where 

P'i=P,-p'i.i l^/s5 
9'. = 9/ + 9'w 

15 The LSI coefficients are then calculated as: 

Isi = I " V^"^ ^^' ^ 0 

' [l.0-0.57TT^ /jc, <0 

The LSCs can be obtained back from the LSI coefficients according to: 



Isc, = 



1.0-4/j/f /jf, <0.5 

(4-4/jjf)-1.0 /5Z,> 0.5 



wo 00/38179 



PCT/US99/30587 



14 

The StabiUty of the LPC filter guarantees that the roots of the two functions 
alternate. x.e.. the smallest root. lsc„ is the smallest root of PYxJ, the next smallest root Isc, 
IS the smallest root ofQXxJ, etc. Thus, lsc„ Isc,, lsc„ Isc,, and Isc, are the roots of P^! 
and Isc;, lsc„ lsc„ he,, and lsc,o are the roots of Q'(x). 
5 Those skilled m the art will recognize that it is preferable to employ some method 

for computing the sensitivity of the LSI coefficients to quantization. "Sensitivity 

wetghtmgs-canbeusedinthequantizationprocesstoappropriatelyweightthequantization 
error in each LSI. 

The LSI coefficients are quantized using a multistage vector quantizer (VQ) The 
10 number of stages preferably depends on the particular bit rate and codebooks employed 
The codebooks are chosen based on whether or not the current fi-ame is voiced. 

The vector quantization minimizes a xveighted-mean-squared error (WMSE) which 
is defined as 



i«0 



15 where x is the vector to be quantized, iu the weight associated with it, and y is the 
codevector. In a preferred embodiment, w are sensitivity weightings and P = 10. 

The LSI vector is reconstructed from the LSI codes obtained by way of quantization 

as qlsi = X CBi^^,^^ where CBi is the stage VQ codebook for either voiced or 

unvoiced frames (this is based on the code indicatmg the choice of the codebook) and code, 

20 is the LSI code for the/"'* stage. 

Before the LSI coefficients are transformed to LPC coefficients, a stability check 
is performed to ensure that the resulting LPC filters have not been made unstable due to 
quantization noise or channel errors injecting noise into the LSI coefficients. Stability is 
guaranteed if the LSI coefficients remain ordered. 

25 ^ In calculating the original LPC coefficients, a speech window centered between the 
11 9* and 120* sample of the frame was used. The LPC coefficients for other points in the 
fi-ame are approximated by interpolating between the previous fi-ame's LSCs and the current 
frame's LSCs. The resulting interpolated LSCs are then converted back into LPC 
coefficients. The exact interpolation used for each subfi-ame is given by: 
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Use J = (1 - a)lscprevj + afsccurrj, \ ^ jq 

where a, are the interpolation factors 0.375. 0.625, 0.875. 1.000 forthe four subframes of 
40 samples each and Use are the interpolated LSCs. (z) and Q, (z) are computed by 
the interpolated LSCs as 



5 

<2^ (2) = (1 - r ' ) n 1 - 2/7^c2,2-' + 
5 The interpolated LPC coetBcients for all four subframes are computed as coefficients of 

Thus, 



A A 

2 

A A 

P\\-i-1n-i 



1 < / < 5 



6^ J < 10 



C. NACF Calculation 



In step 506, the normalized autocorrelation functions (NACFs) are calculated 
according to the current invention. 

10 

as 



The formant residual for the next frame is computed over four 40 sample subframes 



10 

r( n) = s(n) - ^ a,s(n - i) 



1=1 
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where a, is the interpolated LPC coefficient of the corresponding subframe, where the 
interpolation is done between the current frame's unquantized LSCs and the next frame's 
LSCs. The next frame's energy is also computed as 



= 0.5 log. 



( 159 

Z r'(n) 



160 



5 The residual calculated above is low pass filtered and decimated, preferably using 

a zero phase FIR filter of length 15, the coefficients of which df„ -1 s i ^ 7. are {0.0800, 
0.1256, 0.2532, 0.4376, 0.6424. 0.8268, 0.9544, 1.000, 0.9544, 0.8268, 0.6424, 0.4376, 
0.2532, 0. 1256, 0.0800). The low pass filtered, decimated residual is computed as 

7 

fd(n)=Y.df^r(Fn+i), 0<w<160/F 



10 where 2 is the decimation factor, and r{Frt + /), -7 < F« + / ^ 6 are obtained from the 
last 14 values of the current frame's residual based on unquantized LPC coefficients. As 
mentioned above, these LPC coefficients are computed and stored during the previous 
frame. 

The NACFs for two subframes (40 samples decimated) of the next frame 
15 calculated as follows: 

39 



are 



39 



i=0 



39 



12/2< 7<128/2,A:=0,1 



= Z (40^ + / - y) r, {40k + / - J), 
12/2<y<128/2,/: = 0,l 



n corn 



12/2< y<128/2,jt=0,l 
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For r/n) with negative n, the current frame's low-pass filtered and decimated 
residual (stored during the previous frame) is used. The NACFs for the current subframe 
corr were also computed and stored during the previous frame. 



c 



5 



D. Pitch Track and Lag Calculation 

In step 508, the pitch frack and pitch lag are computed according to the present 
invention. The pitch lag is preferably calculated using a Viterbi-like search with a 
backward track as follows. 

R\ = n_coirr^. + max {n_corr,.^,.^J^ 
0<i< 1 16/2,0 < j < FAN,, 



^0 R2, = c_corr,^ + max{^l^.,^^^^ ), 

0< / < 116/2.0< j < FAN,, 

0 < 2 < 1 16/2,0 < j < FAN, , 



where /v4A^,, is the 2 X 58 matrix, {{0,2}, {0.3}, {2,2}. {2.3}, {2.4}, {3,4}, {4.4}, {5,4}, 
{5.5}, {6,5}, (7.5}. {8,6}, {9,6}, {10.6}, {11,6}, {11,7}. {12,7}, {13,7}, {14,8}, { 15,8}, 
{16,8}, {16,9}, {17,9}, {18,9}, {19,9}, {20,10}. {21.10}, {22,10}, {22,11}, {23,11}. 
15 {24,1 1}, {25,12}, {26,12}, {27,12}, {28,12}, {28,13}, {29,13}. {30,13}, {31,14}, {32,14}, 
{33,14}, {33,15}, {34.15}, {35,15}, {36,15}, {37,16}. {38,16}, {39,16}, {39,17}. {40,17}, 
{41.16}, {42,16}. {43,15}, {44,14}, {45,13}, {45,13}, {46,12}, {47,11}}. The vector 
RM^ is interpolated to get values for i?^^, as 

4 

^iF^x = E cfJ^n~^.J)F . 1 ^ / < 112 / 2 
RM, = (RM^ + RMJ/2 

RM^,„.X = ^2.57 
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Where is the interpolation filter whose coefficients are {-0.0625, 0.5625, 0:5625 
-0.0625}. The lag is then chosen such that R,^ ^^ = mcvcfR, ; , 4 . / < 1 16 and the 
current frame's NACF is set equal to / 4 . Lag multiples are then removed by 

searching for the lag corresponding to the maximum correlation greater than 0.9 R 



5 amidst: 



R 

max 



[[i^.'M J.14.16} for all 1 i A/ i /16j. 



E. Calculation of Band Energy and Zero Crossing Rate 

In step 510, energies in the 0-2kHz band and 2kHz-4kHz band are computed 
according to the present invention as 



10 



159 



1=0 



159 



1=0 



where, 



bk + y bh.z-^ 

S(r). S,(r) and S^.) being the z-transforms of the input speech signal s(n), low-pass signal 

s^(n) and high-pass signal respectively, A/={0.0003, 0.0048, 0.0333, 0. 1443, 0.4329 

15 0.9524. 1.5873.2.0409,2.0409, 1.5873.0.9524.0.4329.0.1443,0.0333,0.0048,0.0003}' 

a/={1.0,0.9155.2.4074. 1.6511,2.0597, 1.0584,0.7976.0.3020,0.1465,0.0394,00122' 
0.0021, 0.0004. 0.0. 0.0. 0.0}. M={0.0013, -0.0189, 0.1324, -0.5737. 1.7212, -3.7867,' 

6.3112. -8.1144.8.1144. -6.3112,3.7867.-1.7212,0.5737, -0.1324,0.0189. -0.0013} and 
«/7={1.0, -2.8818. 5.7550. -7.7730, 8.2419. -6.8372, 4.6171. -2.5257, 1.1296. -0.4084, 
20 0.1183,-0.0268.0.0046,-0.0006.0,0,0.0}. 
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159 



The speech signal energy itself is ^ = I] ^ ■ The zero crossing rate ZCR is 

1=0 

computed as 

if(5(nK« + 1 ) < 0)ZCR = ZCR + 1 , 0 ^ n< 1 59 

F. Calculation of the Formant Residual 

5 In step 512, the formant residual for the current frame is computed over four 

subframes as 

10 A 

where d, is the /'* LPC coefficient of the corresponding subframe. 
rV. Active/Inactive Speech Classification 

Referring back to FIG. 3, in step 304, the current frame is classified as either active 
10 speech {e.g., spoken words) or inactive speech {e.g., background noise, silence). FIG. 6 is 
a flowchart 600 that depicts step 304 in greater detail. In a preferred embodiment, a two 
energy band based thresholding scheme is used to determine if active speech is present. 
The lower band (band 0) spans frequencies from 0.1-2.0 kHz and the upper band (band 1) 
from 2.0-4.0 kHz. Voice activity detection is preferably determined for the next frame 
15 during the encoding procedure for the current frame, in the following manner. 

In step 602, the band energies Eb[i] for bands / = 0, 1 are computed. The 
autocorrelation sequence, as described above in Section III.A., is extended to 1 9 using the 
following recursive equation: 

10 

R(k)= la.R(k-i), 11<A:<19 
1=1 

20 Using this equation, R{JI) is computed from RO) to R(IOl R(12) is computed from R(2) 
to R(ll), and so on. The band energies are then computed from the extended 
autocorrelation sequence using the following equation: 
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XO=iog,[ 



wherc/?r^; is theextendedautocorrelationsequenceforthe current frameand 
band filter autocorrelation sequence for band / given in Table 1. 

Table 1: Filter Autocorrelation Sequences for Band Energy Calculations 



10 



15 



20 



25 



k 


1 RLrOVjfcl hanH fl 


T>/l/'l\l_ <« 

R;,(I(>t) band 1 


0 


1 , 

4.230889E-01 


4.042770E-01 


1 


2.693014E-01 


-2.503076E-01 


2 


-1.124000E-02 


-3.059308E-02 


3 


-L301279E-01 


1.497124E-01 


4 


-j.y49044E-02 


-7.905954E-02 


5 


1.494007E-02 


4.371288E-03 


6 


-2.087666E-03 


-2.088545E-02 


7 


-3.823536E-02 


5.622753E-02 


8 


-2.748034E-02 


-4.420598E-02 


9 


3.015699E-04 


1.443167E-02 


10 


3.722060E-03 


-8.462525E.03 


11 


-6.416949E-03 


1.627144E-02 


12 


-6.551736E-03 


-1.476080E-02 


13 


5.493820E-04 


6.1 8704 lE-03 


14 


2.934550E-03 


-1. 89863 2E-03 


15 


8.041829E.04 


2.053 577E-03 


16 


-2.857628E-04 


-1.860064E-03 


17 


' 2.585250E-04 


7.72961 8E-04 


18 


4.816371E-04 


-2.297862E-04 


19 


1. 69273 8E-04 


2.107964E-04 



In step 604, the band energy estimates are smoothed. The smoothed band energy 
estimates, E^OJ, are updated for each frame using the following equation. 

E^(i) = 0.6^^(0+ 0.4£:,(0, / = 0,1 
In step 606. signal energy and noise energy estimates are updated. The signal 
30 energy estimates, E/i), are preferably updated using the following equation: 

EXO = max(^,^(/),£XOX / = 0.1 
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The noise energy estimates. E^(i), are preferably updated using the foUowing 
equation: 

^„(/)=min(£:,„(0,£„(/)), / = 0,1 

In step 608, the long term signal-to-noise ratios for the two bands, SNR(i), are 
5 computed as 

SNR{i)= EXi)-E„<J), ,= 0,1 

In step 610, these SNR values are preferably divided into eight regions Reg,^(i) 
defined as 



0 0.6SNRQ) - 4 < 0 

round (0.6SNR(j) - 4) < 0.6SNR(i) - 4 < 7 
"7 0.6SNR(i) > 7 



10 



In step 612, the voice activity decision is made in the following manner according 
to the current invention. If either ^,(0)-£„(0) > mRESH(Reg^(0)l or £,(1)-E„(1) > 
THRESH(Reg,^{])), then the frame of speech is declared active. Otherwise, the frame of 
speech is declared inactive. The values of THRESH are defined in Table 2. 



Table 2: Threshold Factors as A function of the SNR Region 



15 



SNR Reeion 


THRESH 


0 


2.807 


1 


2.807 


2 


3.000 


3 


3.104 


4 


3.154 


5 


3.233 


6 


3.459 


7 


3.982 



20 



The signal energy estimates, E/i), are preferably updated using the following 
25 equation: 

^,(0= -^XO- 0.014499, ; = 0,1. 



wo 00/38179 



PCT/US99;30S87 



22 



The noise energy estimates, E/i), are preferably updated 

equation: 



using the following 



4 

23 

^„ (0+0.0066 



^„(0+ 0.0066 < 4 
23< £■„ (0+0.0066, / = 0,1 
otherwise 



Hangover Frames 



:> When signal-to-ncse ratios are low, "hangover" frames are preferably added to 

-prove the quality of the reconstructed speech, If the three previous frames were 
classified as active, and the current frame is classified inactive, then the next M frames 
mcludmg the currem frame are classified as active speech. The number of hangover 
frames. M, is preferably determined as a fi^nction ofSNRifi) as defined in Table 3 



10 



Table 3: Hangover Frames as a Function of SNR(O) 



15 



SNRfO) 


M 1 


0 


4 


1 


3 


2 


3 


3 


3 


4 


3 


5 


3 


6 


3 


7 


-> 



20 V. 



Classification of Active Speech Frames 



Referring back to FIG. 3, in step 3 08. current frames which were classified as being 
active m step 304 are further classified according to properties exhibited by the speech 
signal s(n). In a preferred embodiment, active speech is classified as either vo.ced 
unvoiced, or transient. The degree of periodicity exhibited by the active speech signal 
25 determines how it is classified. Voiced speech exhibits the highest degree of periodicity 
(quasi-periodic in nature). Unvoiced speech exhibits little or no periodicity. Transient 
speech exhibits degrees of periodicity between voiced and unvoiced. 
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However, the general framework described herein is not limited to the preferred 
classification scheme and the specific encoder/decoder modes described below. Active 
speech can be classified in alternative ways, and alternative encoder/decoder modes are 
available for coding. Those skilled in the art will recognize that many combinations of 
5 classifications and encoder/decoder modes are possible. Many such combinations can 
result in a reduced average bit rate according to the general fi-amework described herein, 
i.e., classifying speech as inactive or active, fiirther classifying active speech, and then 
coding the speech signal using encoder/decoder modes particularly suited to the speech 
falling within each classification. 
10 Although the active speech classifications are based on degree of periodicity, the 

classification decision is preferably not based on some direct measurement of periodicty. 
Rather, the classification decision is based on various parameters calculated in step 302, 
e.g., signal to noise ratios in the upper and lower bands and the NACFs. The preferred 
classification may be described by the following pseudo-code: 

1 5 if TioiipreviousN A CF< 0.5 and currentNACF > 0. 6) 

if {currentN ACF< 0.7 5 and ZCR > 60) UNVOICED 
else if (previousNACF < 0.5 and currentNACF < 0.55 

and ZCR > 50) UNVOICED 
else if (currentNACF < 0.4 and ZCR > 40) UNVOICED 
20 if {UNVOICED and cvrrentSNR > 2SdB 

and >aE„) TRANSIENT 
if (previousNACF < 0.5 and curreniN ACF < 0 .5 

andE<SeA+N) UNVOICED 
if (VOICED and low-bandSNR > high-bandSNR 
25 and previousNACF < 0.8 and 

0.6 < curreniN ACF < 0.15) TRANSIENT 



_ J1.0, E>SeS\ N 
"'^^'^'' = 120.0, £<5e5 

and ^„oi,. is an estimate of the background noise. E^,^ is the previous frame's input energy. 
The method described by this pseudo code can be refined according to the specific 
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environment in which it is implemented. Those skilled in the art wiU recognize that the 
various thresholds given above are merely exemplary, and could require adjustment in 
practice depending upon the implementation. The method may also be refined by adding 
additional classification categories, such as dividing TRANSIENTinto two categories: one 
5 for signals transitioning fi-om high to low energy, and the other for signals transitioning 
from low to high energy. 

Those skilled in the art will recognize that other methods are available for 
distinguishing voiced, unvoiced, and transient active speech. Similarly, skilled artisans will 
recognize that other classification schemes for active speech are also possible. 

10 VI. Encoder/Decoder Mode Selection 

In step 310, an encoder/decoder mode is selected based on the classification of the 
currem frame in steps 304 and 308. According to a preferred embodiment, modes are 
selected as foUows: inactive frames and active unvoiced frames are coded using a NELP 
mode, active voiced frames are coded using a PPP mode, and active transient frames are 
15 coded using a CELP mode. Each of these encoder/decoder modes is described in detail in 
following sections. 

In an alternative embodiment, inactive frames are coded using a zero rate mode 
Skilled artisans will recognize that many alternative zero rate modes are available which 
require very low bit rates. The selection of a zero rate mode may be fiirther refined by 

20 considering past mode selections. For example, if the previous frame was classified as 
active, this may preclude the selection of a zero rate mode for the current frame. Similarly, 
if the next frame is active, a zero rate mode may be precluded for the current frame. 
Another alternative is to preclude the selection of a zero rate mode for too many 
consecutive frames {e.g., 9 consecutive frames). Those skilled in the art will recognize that 

25 many other modifications might be made to the basic mode selection decision in order to 
refine its operation in certain environments. 

As described above, many other combinations of classifications and 
encoder/decoder modes might be alternatively used within this same framework. The 
following sections provide detailed descriptions of several encoder/decoder modes 

30 according to the present invention. The CELP mode is described first, followed by the PPP 
mode and the NELP mode. 
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vn. Code Excited Linear Prediction (CELP) Coding Mode 

As described above, the CELP encodcr/decodermode is employed when the current 
frame is classified as active transient speech. The CELP mode provides the most accurate 
signal reproduction (as compared to the other modes described herein) but at the highest bit 
5 rate. 

FIG. 7 depicts a CELP encoder mode 204 and a CELP decoder mode 206 in further 
detail. As shown in FIG. 7A, CELP encoder mode 204 inchides a pitch encoding module 
702. an encoding codebook 704. and a filter update module 706. CELP encoder mode 204 
outputs an encoded speech signal, s^in). which preferably includes codebook parameters 
10 and pitch filter parameters, for transmission to CELP decoder mode 206. As shown in FIG. 
7B, CELP decoder mode 206 includes a decoding codebook module 708, a pitch filter 710, 
and an LPC synthesis filter 712. CELP decoder mode 206 receives the encoded speech 
signal and outputs synthesized speech signal s(n) . 



A. Pitch Encoding Module 



15 Pitch encoding module 702 receives the speech signal sfn) and the quantized 

residual fi^om the previous frame, p/nj (described below). Based on this input, pitch 
encoding module 702 generates a target signal x(nj and a set of pitch filter parameters. In 
a preferred embodiment, these pitch fiker parameters include an optimal pitch lag and 
an optimal pitch gain b*. These parameters are selected according to an "analysis-by- 

20 synthesis" method in which the encoding process selects the pitch fiher parameters that 
minimize the weighted error between the input speech and the synthesized speech using 
those parameters. 

FIG. 8 depicts pitch encoding module 702 in greater detail. Pitch encoding module 
702 includes a perceptual weighting filter 802. adders 804 and 81 6. weighted LPC synthesis 
25 filters 806 and 808. a delay and gain 810. and a minimize sum of squares 812. 

Perceptual weighting filter 802 is used to weight the error between the original 
speech and the synthesized speech in a perceptuaUy meaningfiil way. The perceptual 
weighting filter is of the form 
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A{z/y) 

where A(z) is the LPC prediction error filter, and r preferably equals 0.8. Weighted LPC 
analysis filter 806 receives the LPC coefficients calculated by initial parameter calculation 
module 202. Filter 806 outputs which is the zero input response given the LPC 

coefficients. Adder 804 sums a negative input a.>; and the filtered input signal to form 
5 target signal x^n;. 

Delay and gain 8 1 0 outputs an estimated pitch fiher output bpjnj for a given pitch 
lag L and pitch gain b. Delay and gain 810 receives the quantized residual samples from 
the previous fr^o,p/n), and an estimate of fixture output of the pitch filter, given by/,/„;, 
and forms p(n) according to: 

, , [aW -128<«<o 



10 which is then delayed by L samples and scaled by b to form bp,(n). Lp is the subframe 
length (preferably 40 samples). In a preferred embodiment, the pitch lag, L, is represented 
by 8 bits and can take on values 20.0, 20.5, 21.0. 21.5, 126.0, 126.5, 127.0, 127.5. 

Weighted LPC analysis filter 808 filters bp^^n) using the current LPC coefficients 
resulting in byjn). Adder 8 1 6 sums a negative input by.fn) with x(nj, the output of which 

15 is received by minimize sum of squares 812. Minimize sum of squares 812 selects the 
optimal L, denoted by and the optimal b, denoted by b*, as those values ofL and b that 
minimize Ep„^/L) according to: 



n=0 



If E^iDA x{n)yM and E^(L)A Z y^' , then the value of b which 
minimizes E^,,,^ (L) for a given value of L is 
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for which 



where K is a constant that can be neglected. 

The optimal values of L and b (L* and A*) are found by first determining the value 
ofZ, which minimizes E^,,^(LJ and then computing b*. 
5 These pitch filter parameters are preferably calculated for each subframe and then 

quantized for efficient transmission. In a preferred embodiment, the transmission codes 
PLAGj and PGAINj for the/" subframe are computed as 



PGAIN; = 



min(i*,2)- + 0.5 
2 



-1 



2L*, 0:<PGAIN/<8 



PG/l/Af,isthenadjustedto-l ifPZ^G,is setto 0. These transmission codes are transmitted 
to CELP decoder mode 206 as the pitch filter parameters, part of the encoded speech signal 

B. Encoding Codebook 

Encoding codebook 704 receives the target signal x(n) and determines a set of 
codebook excitation parameters which are used by CELP decoder mode 206. along with the 
pitch filter parameters, to reconstruct the quantized residual signal. 
1 5 Encoding codebook 704 first updates x(n) as follows. 

= x(«) - y^^^ («), 0 s /!< 40 
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-h-e is the output of the weighted LPC synthesis filter (with memories retained 

from the end of the previous subframe) to an input which is the zero -input-response of the 
pitch filter with parameters £*and *%(and memories resuhing from the previous 
subframe's processing). 
3 A backfiltered target ^ = ^ ) , 0 . « < 40 is created as d = H^x where 



H = 



0 0 
h. 0 



0 
0 



A9 ^38 ^Zl '-h^^ 

is the impulse response matrix formed from the impulse response {h„) and 
X = {x(n)],0 ^ « < 40 . Two more vectors $ = } and J are created as well. 



where 



S = sign(J) 



39- 



2 I hfy^„, 0< «< 40 

i=0 

n=0 

Li=0 



1, x^O 
- 1, :c < 0 



Encoding codebook 704 initializes the values Exy* and Eyy* to zero and searches 
for the optimum excitation parameters, preferably with four values of (0, ], 2, 3), 
according to: 
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P = 
A = 
B = 



{0.U,3,4})%5 
5,...,/' < 40} 
{a»P, + 5,...,^' <40} 



argmax 



A = {P2,P2 + ^,- ;i' < 40} 

B= {p„P-, + 5,..., k' < 40} 

Den,, = EyyO+ 2^, + ^.(^o^^,;,.,. + S,^^^__^ 
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J ^ Ak E B 



argmax 

fceB 



Den 



Exy\ = £rvO+|(^^J+|c/^J 
Eyy\ = Den, , 

{;'4,P4 + 5,...,/' <40) 



A 
Den, 



e A 



\ExyU\d\\ 
argmax^ — - — i— ^ 

t€A DeHf 



Eyyl = Den J 

If Exyi" Eyy* > Exy"" Eyy2 { 
^xy* = £xv2 

{/•/la',^ ind^^, ind^y^ = {/^ 4 4 4} 

{-y^;,^ ^g"^;, sgn^„ sg?i^„ sgji^,) = {^^ s,. Sj. S,} 

} 

Encoding codebook 704 calculates the codebook gain G* as^, and then 

Eyy*' 

quantizes the set of excitation parameters as the following transmission codes for the/* 
subframe: 
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SlGNjk = 
CBGJ = 



31 



0< k<5 



5 

0, sgn^ = 1 

. 0^A:<5 

1, sgn^ = -l 



inin{log2(inax{l,G*}),11.2636}— 0.5 

1 1.2636 



11.2636 



and the quantized gain 6* is 2 5' 

Lower bit rate embodiments of the CELP encoder/decoder mode may be realized 
by removing pitch encoding module 702 and only performing a codebook search to 
determine an index / and gain G for each of the four subframes. Those skilled in the art 
5 will recognize how the ideas described above might be extended to accomplish this lower 
bit rate embodiment. 

C. CELP Decoder 

CELP decoder mode 206 receives the encoded speech signal, preferably including 
codebook excitation parameters and pitch filter parameters, from CELP encoder mode 204, 
10 and based on this dau outputs synthesized speech s(n) . Decoding codebook module 708 
receives the codebook excitation parameters and generates the excitation signal cbCn) with 
a gain of G. The excitation signal cb(n) for the/" subframe contains mostly zeroes except 
for the five locations: 

Ii, =5CBlJk + k, 0<k<5 
which correspondingly have impulses of value 

Sy=\- ISlGNjk, 0<,k<5 

15 all of which are scaled by the gain G which is computed to be2''^''-'^ , to provide 
GcbfnJ. 

Pitch filter 710 decodes the pitch filter parameters from the received transmission 
codes according to: 
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2^ _ PLAGy 
2 

0. Z* = 0 

-PGAIN/, ?t 0 

^ o 



Pitch filter 710 then filters Gcb(n), where the fiher has a transfer fiinction given bv 



1 1 



In a preferred embodiment, CELP decoder mode 206 also adds an extra pitch 
filtering operation, a pitch prefilter (not shown), after pitch filter 710. The lag for the pitch 
prefiker is the same as that of pitch filter 71 0, whereas its gain is preferably half of the pitch 
5 gain up to a maximum of 0.5. 

LPC synthesis filter 712 receives the reconstructed quantized residual signal 
r(n) md outputs the synthesized speech signal s(n) . 

D. Filter Update Module 

Filter update module 706 synthesizes speech as described in the previous section 
0 in order to update filter memories. Filter update module 706 receives the codebook 
excitation parameters and the pitch filter parameters, generates an excitation signal cbCn), 
pitch filters Gcb(n), and then synthesizes s(n) . By performing this synthesis at the 

encoder, memories in the pitch fUter and in the LPC synthesis fiher are updated for use 
when processing the following subfrarae. 
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Vra. Prototype Pitch Period (PPP) Coding Mode 

Prototype pitch period (PPP) coding exploits the periodicity of a speech signal to 
achieve lower bit rates than may be obtained using CELP coding. In general. PPP coding 
involves extracting a representative period of the residual signal, referred to herein as the 
5 prototype residual, and then using that prototype to construct earlier pitch periods in the 
frame by interpolating between the prototype residual of the current frame and a similar 
pitch period from the previous frame (,. e. , the prototype residual if the last frame was PPP). 
The effectiveness (in terms of lowered bit rate) of PPP coding depends, in part, on how 
closely the current and previous prototype residuals resemble the intervening pitch periods. 
10 For this reason, PPP coding is preferably applied to speech signals that exhibit relatively 
high degrees of periodicity {e.g., voiced speech), referred to herein as quasi-periodic speech 
signals. 

FIG. 9 depicts a PPP encoder mode 204 and a PPP decoder mode 206 in further 
detail. PPP encoder mode 204 includes an extraction module 904, a rotational correlator 
15 906. an encoding codebook 908, and a filter update module 910. PPP encoder mode 204 
receives the residual signal r(n) and outputs an encoded speech signal s,Jn), which 
preferably includes codebook parameters and rotational parameters. PPP decoder mode 206 
includes a codebook decoder 912, a rotator 914. an adder 916, a period interpolator 920, 
and a warping filter 918. 

20 FIG. 1 0 is a flowchart 1 000 depicting the steps of PPP coding, inchiding encoding 

and decoding. These steps are discussed along with the various components of PPP 
encoder mode 204 and PPP decoder mode 206. 



wo 00/38179 



PCT/US99/30S87 



34 

Extraction Module 



la «ep ,002. =«,.cdc„ 904 extracts . pro,o.„e te.d„al r^„) ft„„ .^^ 

™odt.» 202 employs an LPC analysis ffltet .„ e<,.p„e for each ftan,. In a preferred 
S n,.od,.., .He LPC eoe«cie„.s ,n U.s «.er are percep.„.„ ^^^^ 

Sec„„n v:i, The ,en.U, of,/„; U e,u. ,„ ..e pi,c. ,a,z _p„,ed «a, par,„.e,er 
.aleulaaon „,„<,„,» 202 dunng Ae las, subframe in U,e current ftame 

FIG^„is.f,„„char,depictings,ep,002i„grea.erde.«I. PPP e«r,c..o„ „.„dule 
904 preferably selects a pitch pedod as close ,0 the end of the frame as possible, subject to 
certatn resWcUons disorssed below. HG, ,2 depicts an example of a residual signal 
cal^ated based on ,uasi.periodic speech, including the current fra^e and the last 

subtrame from the previous frame. 

Instepn02,a-cut-freeregion"isdetennined. The cut-free region defines a set of 
«-P'- in the res.dual which cannot be endpointsofthe prototype residual. The cut-free 
region ensures that high energy regions of the residual do not occur at the beginning or end 

,oftheprototype(which could causediscontinuitiesintheoutputwereitallowed to happen) 
The absolute value of each of the final L samples of is calculated. The variable is 
set equal to the time index of the sample with the largest absolute value, referred to herein 
as the "pitch spike. " For example, if the pitch spike occurred in the last sample of the final 
20 L samples. = L-X. In a preferred embodiment, the minimum sample of the cut-free 

region. CF„,„ is set to beP.-aor/'.-O.ZSZ. whichever is smaller. The maximum of the 
cut-free region. CF„^ is set to be 6 or P, ^ 0.25Z, whichever is larger 

In step 1104. the prototype residual is selected by cutting L samples from the 
resxdual. The region chosen is as dose as possible to the end of the frame, under the 
25 constraint that the endpoints of the region cannot be within the cut-free region The L 
samples of the prototype residual are determined using the algorithm described in the 
following pseudo-code: 

'fl:CF„,„<0){ 

for(/ = 0 to Z + CF^^- 1) r^{i) = r{i+ 1 60-Z,) 
30 for(/ = CF„,„ to 1 ) r^{i) = 60-2Z) 

> 
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elseif(CF_^ Z ( 

for(/ = 0 to CF„,-\) r^(i) = r(/+160-Z) 
for(7 = CF„,„ to L-\) r^{i) = r(/>160-2L) 

} 

5 else ( 

for(/ = 0 to L-\) r^{i) = r(i>160-Z.) 

} 

B. Rotational Correlator 

Referring back to FIG. 10, in step 1 004, rotational correlator 906 calculates a set of 
10 rotational parameters based on the current prototype residual, r/n), and the prototype 
residual from the previous frame, r^^Jri). These parameters describe how r^^Jn) can best 
be rotated and scaled for use as a predictor of r/n). In a preferred embodiment, the set of 
rotational parameters mcludes an optimal rotation /?* and an optimal gain A*. FIG. 13 is 
a flowchart depicting step 1 004 in greater detail. 
1 5 In step 1 302, the perceptually weighted target signal x(n), is computed by circularly 

filtering the prototype pitch residual period rp(n) . This is achieved as follows. A temporary 
signal tmp\{n) is created from r/n) as 

which is fihered by the weighted LP C synthesis filter with zero memories to provide an 
20 output tmp2(n) In a preferred embodiment, the LPC coefficients used are the perceptually 
weighted coefficients corresponding to the last subframe in the current firame. The target 
signal x(n) is then given by 

x{n) = tmp2{p) + tmpl{n L), 0 < n <L 

In step 1304, the prototype residual from the previous fi-ame, r^^^(n), is extracted 
25 from the previous fi-ame's quantized formant residual (which is also in the pitch filter's 
memories). The previous prototype residual is preferably defined as the last values of 
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the previous frame's formant residual, where is equal to L if the previous frame was not 
a PPP frame, and is set to the previous pitch lag otherwise. 

In step 1 306, the length of r^Jn) is altered to be of the same length as x(n) so that 
correlations can be correctly computed. This technique for altering the length of a sampled 
5 stgnal is referred to herein as warping. The warped pitch excitation signal, n.^Un), may 
be described as 



f^'pM = * TWF), Oin<L 

where TfVF is the time warping factor ^ The sample values at non-integral points « * 

TWare preferably computed using a set of ^,>,c function tables. The sine sequence chosen 
10 is ^'M-3-/^:4-/)whereFisthe fractional part of« *7WF rounded to the nearest 
multiple of ^ . The beginning of this sequence is aligned with V„((7y^-3)%Z.,) where AT is 

the integral part of n *TWF after being rounded to the nearest eighth. 

In step 1308, the warped pitch excitation signal rw^,Jn) is circularly filtered, 
resulting in y(n). This operation is the same as that described above with respect to step 
15 1302, but applied to rw (n). 



In step 1310. the pitch rotation search range is computed by first calculating an 
expected rotation E,g„ 

^ 2L^L 

where frac(x) gives the fractional part of x. If Z < 80, the pitch rotation search range is 
20 defined to be - 8, E^^ - 7.5. ... E.^ ^ 7.5), and - 16. - 1 5, ... E^^, 1 5} where 
L2:80. 

Instep 13 12, the rotational parameters, optimal rotation/?* and an optimal gain 6*. 
arc calculated. The pitch rotation which rcsuhs in the best prediction between xCnJ !indy(nj 
is chosen along with the corresponding gain b. These parameters are preferably chosen to 
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minimize the error signal e(n) = x(n).y(n). The optimal rotation i?* and the 



optimal gain 



* * are those values of rotation /? and gain A which result in the maximum value of ^ 
wtoef:«„.|'.(0.«)%iMO and£K. = |;x/M/) for wMch the opttaai gain 

1=0 

" Eyy " ^* fr^^ion^ values of rotation, the value of Exy, is 

5 approximated by interpolating the values of Exy, computed at integer values of rotation. 
A simple four tap interplation filter is used. For example. 

Exy^ = 0.54(Exy^.+ Exy^,^,) - 0.04 * {Exy^.,, + Exy^.^^) 
where /? is a non-integral rotation (with precision of 0.5) and R' = [RJ. 

In a preferred embodiment, the rotational parameters are quantized for efficient 
10 transmission. The optimal gain b* is preferably quantized uniformly between 0.0625 and 



4.0 as 

,(b* -0.062$] 



PGAIN=maK\mm( 



63 



^ 4- 0.0625 ; 



+ 0.5 



.63).0 



where PGAIN is the transmission code and the quantized gain b* is given by 

" r PG^(4-0.0625)^ 1 

I 63 J.00625| . The optimal rotation R* is quantized as the 



15 transmission code PROT, which is set to 2(R* -E^ + i) if £ < 80. and R*-E^ + 



where Li 80. 

Encoding Codebook 



16 



Referring back to FIG. 10, in step 1006, encoding codebook 908 generates a set of 
codebook parameters based on the received target signal x(nj. Encoding codebook 908 
20 seeks to find one or more codevectors which, when scaled, added, and filtered sum to a 
signal which approximates x(nj. In a preferred embodiment, encoding codebook 908 is 
implemented as a multi-suge codebook, preferably three stages, where each stage produces 
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a scaled codevector. The set of codebook parameters therefore includes the indexes and 
ga.ns corresponding to three codevectors. FIG. 14 is a flowchart depicting step 1006 in 
greater detail. 

In step 1402. before the codebook search is performed, the target signal x(n) is 
5 updated as 

x(n) = x(n) - b X(« - R*)%L), 0^n<L 
If in the above subtraction the rotation/?* is non-integral (i.e., has afraction of 0.5), 



then 



yO- - 0.5) = -0.0073(;;(/ - 4) + y(i + 3)) + 0.0322iy(i - 3) + yQ + 2)) 
-0.1 363(>/(/ - 2) + y(i + 1)) + 0.6076(;.(/ - 1) + y(i)) 



10 where i = n- LR*J. 

In step 1404, the codebook values are partitioned into multiple regions. According 
to a preferred embodiment, the codebook is determined as 

1, rt = 0 

<«) = ]o, 0<rt<Z. 

CBP(n-LX L<n<\2Z+L 



where CBP are the values of a stochastic or trained codebook. Those skilled in the art will 
15 recognize how these codebook values are generated. The codebook is partitioned into 

multiple regions, each of length L. The first region is a single pulse, and the remaining 

regions are made up of values from the stochastic or trained codebook. The number of 

regions will be f 1 28/l1 . 

In step 1406, the multiple regions of the codebook are each circularly filtered to 
20 produce the filtered codebooks. y^Jn), the concatenation of which is the signal y(n). For 

each region, tiie circular filtering is performed as described above witiirespect to step 1302. 
In step 1408, the filtered codebook energy. Eyy(reg), is computed for each region 

and stored: 



< reg< N 

/=0 
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In step 1410. the codebook parameters (i.e., codevector index and gain) for each 
stage of the multi-stage codebook are computed. According to a preferred embodiment. let 
Rcgion(l) = reg, defined as the region in which sample / resides, or 

0, , 0<I <L 

2. 2L<,I <3L 



5 and let Exy(I) be defined as 



1=0 



The codebook parameters, /* and G\ for the/" codebook stage are computed using 
the following pseudo-code. 

£x>'*=0,£:>y*=0 
for{I =0 to 127) { 

compute E xy(I) 

if(Exy(I)^Eyy* > Exy*{iyEyy{Region{I))) { 
Exy*=Exy(I) 
Eyy*= EyylRegion(I)) 



^ ^ Exy * 
10 and G* = — — 
Eyy* 

According to a preferred embodiment, the codebook parameters are quantized' for 
eflScient transmission. The transmission code CBI7 (j=stage number - 0. 1 or 2) is 
preferably set to /* and the transmission codes CBGj and SIGN/ are set by quantizing the 
gain G*. 



15 



SIGNy={?;|:;» 
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CBGjJ 



min{max{0.1og,(|G *|)}, 1 1.25} j + 0.5 



and the quantized gain G * is 



The target signal .(rr) is then updated by subtracting the contribution of the 
D codebook vector ofthe current stage 

<") = Jc(«)-GV;j.^,o„^*;((« + /*)%i:), 0<n<L 

Theaboveproceduresstartingfromthepseudo-codearerepeatedtocompute/*G* 
and the corresponding transmission codes, for the second and third stages. 

D. Filter Update Module 

10 Referring back to FIG. lO.instep 1008. filter update module 910 updates the filters 

used by PPP encoder mode 204. Two alterative embodiments are presented for filter 
update module 910. as shown in FIGs. 15A and 16A. As shown in the f.st alternative 
embodiment in FIG. ISA, filter update module 910 includes a decoding codebook 1502 
a rotator 1 504, a warping filter 1 506, an adder 1 5 1 0, an alignment and interpolation module 

15 1508. an update pitch filter module 1512, and an LPC synthesis filter 1514. The second 
embodm^ent, as shown in FIG. 16A, includes a decoding codebook 1602, a rotator 1604 
a warping filter 1606, an adder 1608. an update pitch filter module 1610. a circular LPC 
synthesis filter 1612, and an update LPC filter module 1614. FIGs. 17 and 18 are 
flowchans depicting step 1008 in greater detail, according to the two embodiments 
20 In step 1702 (and 1802, the first step of both embodiments), the current 

reconstructed prototype residual. r^Jn), L samples in length, is reconstructed from the 
codebook parameters and rotational parameters. In a preferred embodiment, rotator 1 504 
(and 1604) rotates a warped version ofthe previous prototype residual according to the 
following; 



25 
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where is the current prototype to be created, nv,,„ is the warped (as described above 

L 

m Section Vni.A.. with 7WF=-^ ) version of the previous period obtained from the 

most recent L samples of the pitch filter memories, b the pitch gain and R the rotation 
obtained from packet transmission codes as 



b= max" 



{o.0625( ^°-^^<;;°°^^)) . 0.0625 



iPROT 

R= J-^— +£"„,-8. i:<80 

\PROT+ E^„-\6, L>&0 

5 where E^^ is the expected rotation computed as described above in Section VIII.B. 

Decoding codebook 1502 (and 1602) adds the contributions for each of the three 
codebook stages to r„„(n) as 

rcurr - -0%L) = ((« - /)o/o L) + l^' I<L,n = 0 

where I^CBIj and G is obtained from CBGJ and SIGNj as described in the previous section. 
j being the stage number. 

10 At this point, the two alternative embodiments for filter update module 910 differ. 

Referring first to the embodiment of FIG. 15A, in step 1704, alignment and interpolation 
module 1 508 fills in the remainder of the residual samples from the beginning of the current 
frame to the beginning of the current prototype residual (as shown in FIG. 12). Here, the 
aUgnment and interpolation are performed on the residual signal. However, these same 

15 operations can also be performed on speech signals, as described below. FIG. 19 is a 
flowchart describing step 1704 in fiirther detail. 

In step 1902, it is determined whether the previous lag is a double or a half 
relative to the current lagZ. In a preferred embodiment, other multiples are considered too 
improbable, and are therefore not considered. If Z,^ > 1 .85L, is halved and only the first 

20 half of the previous period r^^Jn) is used. If < 0.54L, the current lag L is likely a double 
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and consequently L, is also doubled and the previous period is extended by 

repetition. 



In step 1 904, r^,Jn) is warped to form ry,^,Jn) as described above with 



step 1306. with TWF=~f, so that the lengths of both prototype residuals 



respect to 
are now the 



5 same. Note that this operation was performed in step 1 702. as described above, by warping 
filter 1506. Those skilled in the art will recognize that step 1 904 would be umiecessary if 
the output of warping filter 1506 were made available to alignment and interpolation 
module 1508. 

Instep 1906,theaUowablerangeofaligmnentrotationsiscomputed. Theexpected 
10 alignment rotation, E,, is computed to be the same as as described above in Section 
Vni.B. The alignment rotation search range is defined to be {E^ - 3A, E^-dA+ 0.5 -E^ - 
di + l,...,E^+ SA- 1.5, E^ + dA~ 1), where SA = max{6,0.15Z,}. 

In step 1908, the cross-correlations between the previous and current prototype 
periods for integer alignment rotations. R, are computed as 

L-i 

/=o ^ 

and the cross-correlations for non-integral rotations A are approximated by interpolating the 
values of the correlations at integral rotation; 

CiA) = 0.54(C(^') + C(A'+l)) - 0.04(C(.1'-1) + C(A'+2)) 

20 where y4 '= A-0.5. 

In step 1910, the value of^ (over the range of allowable rotations) which results 
the maximum value of CfA) is chosen as the optimal alignment, A *. 

In step 1912, the average lag or pitch period for the intermediate samples, 
computed in the following manner. A period number estimate, Ar„„ is computed as 



m 



15 
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with the average lag for the intermediate samples given by 

(160- L)L 



In step 1914. the remaining residual samples in the current frame are calculated 
according to the following interpolation between the previous and current prototype 

5 residuals: 



1- 



n 



160- 



n 



((na + A*)%L), 0<n<l60-L 

160- L< /7< 160 



160 

r^(n+ L~ 160), 



i 



where a - The sample values at non-integral points « (equal to either na or na 

+A *) are computed using a set of sine function tables. The sine sequence chosen is j/nc(-3 
-F; 4 - F) where F is the fractional part of n rounded to the nearest multiple of - . The 

8 

10 beginning of this sequence is aligned with r^,J(N.3) YoL^ where N is the integral part of « 

after being rounded to the nearest eighth. 

Note that this operation is essentially the same as warping, as described above with 

respect to step 1306. Therefore, in an alternative embodhnent, the interpolation of step 

1914 is computed using a warping filter. Those skilled in the art will recognize that 
15 economies might be realized by reusing a single warping filter for the various purposes 

described herein. 

Returning to FIG. 17, in step 1706, update pitch filter module 1512 copies values 
from the reconstructed residual r{n) to the pitch filter memories. Likewise, the memories 
of the pitch prefilter are also updated. 
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In step 1708. LPC synthesis filter 1514 filters the reconstructed residual r{rt), 
which has the eflfect of updating the memories of the LPC synthesis filter. 

The second embodiment of filter update module 910. as shown in FIG. 1 6A, is now 
descnbed. As described above with respect to step 1702. in step 1802, the prototype 
residual is reconstructed from the codebook and rotational parameters, resulting in r (n) 

Instep 1804, update pitch filter module 1610 updates the pitch filter memoes by 
copying replicas of theZ samples from r^(n), according to 

Pit<^h.rnem{^):=r^^{^L-{\^\-^D^ir^L\ 0 < / < 131 
or alternatively, 

10 /,//c/,_«e;„(131-l-/).;.^^^^(^_l.,../,^j 0^/<131 

where 13 1 is preferably the pitch fiher order for a maximum lag of 127.5. In a prefeired 
embodiment, the memories of the pitch prefilter are identically replaced by replicas of the 
current period r^(n)-. 



Pitch_prefilt_mem{i) = pitch _mem{i\ 0< / < 131 

15 In step 1806, r^Jn) is circulariy filtered as described in Section VIII.B., resulting 

in s^(n), preferably using perceptually weighted LPC coefficients. 

In step 1808, values from s^(n), preferably the last ten values (for a 10* order LPC 
filter), are used to update the memories of the LPC synthesis filter. 

E. PPP Decoder 

20 Returning to FIGs. 9 and lO.instep 1010, PPP decoder mode 206 reconstructs the 

prototype residual r,^„(n) based on the received codebook and rotational parameters. 
Decoding codebook 912, rotator 914, and warping filter 918 operate in the mamier 
described in the previous section. Period interpolator 920 receives the reconstructed 
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prototype residual r^n) and the previous reconstructed prorotype residual r^^^nj, 
interpolates the samples between the two prototypes, and outputs synthesized speechTgnal 
s(n) Period interpolator 920 is described in the following section. 

F. Period Interpolator 

5 ^°^tepl012.periodinterpolator920receivesr^/«;andoutputssynthesi2edspeech 
signal s{n). Two alternative embodiments for period interpolator 920 are presented 
herein, as shown in FIGs. 15B and 16B. In the first alternative embodiment, FIG. 15B. 
period interpolator 920 includes an alignment and interpolation module 1516, an LPC 
synthesis filter 1518. and an update pitch filter module 1520. The second alternative 

10 embodiment, as shown in FIG. 16B. includes a circular LPC synthesis filter 1616. an 
alignment and interpolation module 1618, an update pitch filter module 1622. and an 
update LPC filter module 1620. FIGs. 20 and 21 are flowcharts depicting step 1012 in 
greater detail, according to the two embodiments. 

Referring to FIG. 15B, in step 2002, alignment and interpolation module 1516 

15 reconstructs the residual signal for the samples between the current residual prototype 
r^M and the previous residual prototype r^,jnj, forming P{n). Alignment and 

interpolation module 1516 operates in the manner described above with respect to step 
1 704 (as shown in FIG. 1 9). 

In step 2004, update pitch filter module 1520 updates the pitch filter memories 
20 based on the reconstnicted residual signal r(n) , as described above with respect to step 
1706. 

In step 2006, LPC synthesis filter 1518 synthesizes the output speech signal s(n) 
based on the reconstnicted residual signal r{n) . The LPC filter memories are 
automatically updated when this operation is performed. 
25 Referring now to FIGs. 1 6B and 2 1 , in step 2 1 02, update pitch filter module 1 622 

updates the pitch filter memories based on the reconstnicted current residual prototype. 
''curr(n), as described above with respect to step 1 804. 
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In step 2104. circular LPC synthesis filter 16,6 receives r^(n) and synthesizes a 
™speechprototype../„;(whichisisamplesinlength).asdescr^^^^ 

In step 2106. update LPC filter module 1620 updates the LPC filter memories as 
3 descnbed above with respect to step 1808. 

In step 2108, alignment and interpolation module 1618 reconstructs the speech 
samples between the previous prototype period and the current prototype period The 

prevaousprototyperesidual,.,,„r«;.iscircularlyfiltered(inanLPCsynthesis configuration) 
so that the interpolation may proceed in the speech domain. Alignment and interpolation 
10 module 16,8 operates in the manner described above with respect to step ,704 (see Fig 
19). except that the operations are performed on speech prototypes rather than residual 
prototypes. The result of the aligmnent and interpolation is the synthesized speech signal 



IX. Noise Excited Linear Prediction (NELP) Coding Mode 

15 Noise Excited Linear Prediction (NELP) coding models the speech signal as a 

pseudo-random noise sequence and thereby achieves lower bit rates than may be obtained 
usmg either CELP or PPP codmg. NELP coding operates most effectively, in terms of 
signal reproduction, where the speech signal has little or no pitch structure, such as 
unvoiced speech or background noise. 

20 FIG. 22 depicts a NELP encoder mode 204 and a NELP decoder mode 206 in 

further detail. NELP encoder mode 204 includes an energy estimator 2202 and an encoding 
codebook 2204. NELP decoder mode 206 includes a decoding codebook 2206. a random 
number generator 2210. a multiplier 2212. and an LPC synthesis fiher 2208. 

FIG. 23 is a flowchart 2300 depicting the steps of NELP coding, including encoding 

2.S and decoding. These steps are discussed along with the various components of NELP 
encoder mode 204 and NELP decoder mode 206. 

In step 2302. energy estimator 2202 calculates the energy of the residua, signal for 
each of the four subfi-ames as 
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( 40if39 \ 

n=A^i 

40 



0 S 7 ( 4 



In step 2304. encoding codebook 2204 calculates a set of codebook parameters, 
forming encoded speech signal s.Jn). In a preferred embodiment, the set of codebook 
parameters includes a single parameter, index 70. Index 70 is set equal to the value of y 
which minimizes 

5 

hEsf.-SFEQiJ^y)- where 0.j< 128 

The codebook vectors, SFEQ^ are used to quantize the subframe energies Esf, and include 
anumber of elements equal to the number of subframes within a frame {i.e., 4 in a preferred 
embodiment). These codebook vectors are preferably created according to standard 
10 techniques known to those skilled in the art for creating stochastic or trained codebooks. 

In step 2306, decoding codebook 2206 decodes the received codebook parameters. 
In a preferred embodiment, the set of subframe gains G, is decoded according to; 

Gi = 2^^Q'«' '>, or 

r; = 70.2SFEQaO, O+O.Slog Cpm^Z / u 

' ^ • (Where the previous frame was coded using a 

^■^ zero-rate coding scheme) 

where 0 ^ i < 4 and Gprev is the codebook excitation gain corresponding to the last 
subframe of the previous frame. 

In step 2308, random number generator 2210 generates a unit variance random 
vector nzfnj. This random vector is scaled by the appropriate gain Gi within each subframe 
20 in step 23 10, creating the excitation signal G/iz(n). 

In step 23 12, LPC synthesis filter 2208 filters the excitation signal G^z(n) to form 
the output speech signal, s(ri) . 

In a preferred embodiment, a zero rate mode is also employed where the gain G, and 
LPC parameters obtained from the most recent non-zero-rate NELP subframe are used for 
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each subframe in the current fraxne. Those skilled in the art will recogdze that this zero 
rate mode can effectively be used where multiple NELP frames occur in succession. 



X. Conclusion 



While various embodiments of the present invention have been described above it 
. should be understood that they have been presented by way of example only, and not 
hnutation. Thus, the breadth and scope of the present invention should not be limited by 
any of the above-described exempla^^ embodiments, but should be defined onlv in 
accordance with the following claims and their equivalents. 

The previous description of the preferred embodimems is provided to enable any 
10 person skilled in the art to make or use the present invention. While the invention has been 
particularly shown and described with reference to preferred embodiments thereof it will 
be understood by those skiUed in the art that various changes in form and details may be 
made therein without departing from the spirit and scope of the invention. 
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WHAT IS CLAIMED IS: 

1 A method for the variable rate coding of a speech signal, comprising the steps of: 

(a) classifying the speech signal as either active or inactive; 

(b) classifying said active speech into one of a plurality of types of 
active speech; 

(c) selecting a coding mode based on whether the speech signal is active 
or inactive, and if active, based further on said type of active speech; 
and 

(d) encoding the speech signal according to said coding mode, forming 
an encoded speech signal. 

2. The method of claim 1 , further comprising the step of decoding said encoded speech 
signal according to said coding mode, forming a synthesized speech signal. 

3 The method of claim 1 , wherein said coding mode comprises a CELP coding mode, 
a PPP coding mode, or a NELP coding mode. 

4. The method of claim 3, wherein said step of encoding encodes according to said 
coding mode at a predetermined bit rate associated with said coding mode. 

5. The method of claim 4, wherein said CELP codingmode is associated with a bit rate 
of 8500 bits per second, said PPP coding mode is associated with a bit rate of 3900 bits per 
second, and said NELP coding mode is associated with a bit rate of 1550 bits per second. 

6. The method of claim 3, wherein said coding mode further comprises a zero rate 
mode. 

7. The method of claim 1, wherein said plurality of types of active speech include 
voiced, unvoiced, and transient active speech. 
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8. The method of claim 7, wherein said step of selecting a coding mode comprises the 

Steps of: 

(a) selecting a CELP mode if said speech is classified as active transient 
speech; 

(b) selectmg a PPP mode if said speech is classified as active voiced speech; 
and 

(c) selectmg a NELP mode if said speech is classified as inactive speech or 

active unvoiced speech. 

9. The method of claim 8. wherein said encoded speech signal comprises codebook 
parameters and pitch filter parameters if said CELP mode is selected, codebook parameters 
and rotational parameters .f said PPP mode is selected, or codebook parameters if said 
NELP mode is selected. 



10. The method of claim 1. wherein said step of classifying speech as active or inactive 
comprises a two energy band based thresholding scheme. 

11. The method of claim I. wherein said step of classifying speech as active or inactive 
comprises the step of classifying the next M frames as active if the previous N,„ frames 
were classified as active. 



1 2. The method of claim 1 , fiirther comprising the step of calculating initial parameters 
using a "look ahead." 



13. 



of claim 1 2. wherein said initial parameters comprise LPC coefficients. 
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14. The method of claim 1 , wherein said coding mode comprises a NELP coding mode, 
wherein the speech signal is represented by a residual signal generated by filtering the 
speech signal with a Linear Predictive Coding (LPC) analysis fiher, and wherein said step 
of encoding comprises the steps of: 

(i) estimating the energy of the residual signal, and 

(ii) selecting a codevector from a first codebook, 
wherein said codevector approximates said 
estimated energy; 

and wherein said step of decoding comprises the steps of: 

(i) generating a random vector, 

(ii) retrieving said codevector from a second codebook, 

(iii) scaling said random vector based on said codevector, such that the 
energy of said scaled random vector approximates said estimated 
energy, and 

(iv) filtering said scaled random vector with a LPC synthesis filter, 
wherein said filtered scaled random vector forms said synthesized 
speech signal. 

15. The method of claim 14, wherein the speech signal is divided into frames, wherein 
each of said frames comprises two or more subframes, wherein said step of estimating the 
energy comprises the step of estimating the energy of the residual signal for each of said 
subframes, and wherein said codevector comprises a value approximating sciid estimated 
energy for each of said subframes. 

16. The method of claim 14, wherein said first codebook and said second codebook are 
stochastic codebooks. 

17. The method of claim 14, wherein said first codebook and said second codebook are 
trained codebooks. 



18. The method of claim 14, wherein said random vector comprises a unit variance 
random vector. 
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19. A variable rate coding system for coding a speech signal, comprising: 
classification means for classifying the speech signal as active or inactive, and if 

active, for classifying the active speech as one of a phiralify of types of active speech; and 
a pluraUty of encoding means for encoding the speech signal as an encoded speech 
signal, wherein said encoding means are dynamically selected to encode the speech signal 
based on whether the speech signal is active or inactive, and if active, based further on said 
type of active speech. 

20. The system of claim 19. fiirther comprising a pluraUty of decoding means for 
decoding said encoded speech signal. 

21 . The system of claim 1 9, wherein said plurality of encoding means includes a CELP 
encoding means, a PPP encoding means, and a NELP encoding means. 

22. The system of claim 20, wherein said phiraUty of decoding means includes a CELP 
decoding means, a PPP decoding means, and a NELP decoding means. 

23. The system of claim 21, wherein each of said encoding means encodes at a 
predetermined bit rate. 

24. The system of ckiim 23, wherein said CELP encoding means encodes at a rate of 
8500 bits per second, said PPP encodmg means encodes at a rate of 3900 bits per second, 
and said NELP encoding means encodes at a rate of 1550 bits per second. 

25. The system of claim 21. wherein said pluraUty of encoding means further includes 
a zero rate encoding means, and wherein said pluraUty of decoding means fiirther includes 
a zero rate decoding means. 



26. The system of claim 19, wherein said pluraUty of types of active speech include 
voiced, unvoiced, and transient active speech. 
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27. 



The system of claim 26. wherein said CELP encoder is selected if said speech is 
classified as active transient speech, wherein said PPP encoder is selected if said speech is 
classified as active voiced speech, and wherein said NELP encoder is selected if said speech 
is classified as inactive speech or active unvoiced speech. 

28. The system of claim 27. wherein said encoded speech signal comprises codebook 
parameters and pitch filter parameters if said CELP encoder is selected, codebook 
parameters and rotational parameters if said PPP encoder is selected, or codebook 
parameters if said NELP encoder is selected. 

29. The system of claim 19. wherein said classification means classifies speech as active 
or inactive based on a two energy band thresholding scheme. 

30. The system of claim 19, wherein said classification means classifies the next M 
frames as active if the previous N^. frames were classified as active. 

31. The system of claim 1 9, wherein the speech signal is represented by a residual signal 
generated by filtering the speech signal with a Linear Predictive Coding (LPC) analysis 
filter, and wherein said plurality of encoding means includes a NELP encoding means 
comprising: 

energy estimator means for calculating an estimate of the energy of the residual 
signal, and 

encoding codebook means for selecting a codevector fi-om a first codebook, wherein 
said codevector approximates said estimated energy; 

and wherein said plurality of decoding means includes a NELP decoding means 
comprising: 

random number generator means for generating a random vector, 

decoding codebook means for retrieving said codevector from a second codebook, 

multiply means for scaling said random vector based on said codevector, such that 

the energy of said scaled random vector approximates said estimate, and 

means for filtering said scaled random vector with an LPC synthesis filter, wherein 

said filtered scaled random vector forms said synthesized speech signal. 
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32. The system of claim 19. wherein the speech signal is divided into frames wherein 

eachofsaidframescomprisestwo ormoresubframes, wherein said energy estimatormeans 
calculates an estimate of the energy of the residual signal for each of said subframes and 
wherem said codevector comprises a value approximating said subframe estimate for each 
of said subframes. 



33. The system of claim 19. wherein said first codebook and said second codebook are 
Stochastic codebooks. 



34. The system of claim 19. wherein said first codebook and said second codebook are 
trained codebooks. 



35. The system of claim 19. wherein said random vector comprises a unit variance 
random vector. 
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